library(tidyverse)
library(tidylog)
library(here)
library(janitor)
library(gt)
library(scales)
library(skimr)
library(readxl)
library(tidycensus)
library(tigris)
library(sf)
library(FNN)
library(ggpubr)
library(gridExtra)
library(tidymodels)
library(snakecase)
library(mapview)
library(leaflet)
library(leafpop)
mapviewOptions(basemaps.color.shuffle = FALSE)

project_path = here()

theme_set(theme_light())

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

# Note for grader:
# Because of the large size of the data and the corresponding long evaluation times, we read in the 
# prepared data and modelling output objects that we earlier outputted using the code shown in this RMD.
gt_print <- function(data) {
  
  data %>% 
    gt() %>% 
    cols_label_with(fn = ~ to_sentence_case(.)) %>% 
    tab_style(cell_text(weight = "bold"), 
              locations = list(cells_column_labels(), cells_row_groups())) %>% 
    fmt_number(where(is.numeric), decimals = 0) %>% 
    fmt_percent(contains("percent"), decimals = 1) 
  
}

Making Predictions with Growhere

When launching a business, choosing the right location is crucial to success, whether it is for a first-time small business or a seasoned entrepreneur. Researching which parts of the city have seen growth in recent years might give an accurate enough picture of in-demand locations, but it does not explain what it is about the location that attracts growth. Without training in geospatial data science, a busy entrepreneur might not have all the information needed to make the best-informed decision. That’s where Growhere comes in.

# From https://opendataphilly.org/datasets/licenses-and-inspections-building-and-zoning-permits/
data_permits_raw <- 
  read_sf("https://phl.carto.com/api/v2/sql?filename=permits&format=geojson&skipfields=cartodb_id&q=SELECT+*+FROM%20permits%20WHERE%20permitissuedate%20%3E=%20%272014-01-01%27") 

# From https://opendataphilly.org/datasets/philadelphia-properties-and-assessment-history/
data_opa_raw <-
  read_sf("https://phl.carto.com/api/v2/sql?filename=opa_properties_public&format=geojson&skipfields=cartodb_id&q=SELECT+cartodb_id,the_geom,assessment_date,building_code,building_code_description,category_code,census_tract,exempt_building,exempt_land,exterior_condition,general_construction,geographic_ward,homestead_exemption,location,mailing_zip,market_value,market_value_date,number_of_rooms,number_stories,parcel_number,parcel_shape,quality_grade,recording_date,registry_number,sale_date,sale_price,street_code,street_designation,street_name,taxable_building,taxable_land,topography,total_area,total_livable_area,unfinished,view_type,year_built,zip_code,zoning,pin,building_code_new,building_code_description_new,objectid+FROM+opa_properties_public")
relevant_permits <- data_permits_raw %>% 
  st_drop_geometry() %>% 
  filter(status %in% c("COMPLETED", "EXPIRED", "ISSUED", "READY FOR ISSUE")) %>% 
  filter(str_detect(permittype, "NEWCNST") | 
           str_detect(permitdescription, "NEW CONSTRUCTION") | 
           str_detect(typeofwork, "NEW CONSTRUCTION") |
           (permittype == "BP_ALTER" & typeofwork == "MAJOR")) %>% 
  mutate(permit_year = year(permitissuedate)) %>% 
  arrange(opa_account_num, permit_year) %>% 
  # Reduce to parcel level
  summarize(.by = opa_account_num,
            permit_years = str_flatten(unique(permit_year), "; "))

Philadelphia is a growing, diverse city with ample opportunities for business owners to tap into new markets. Growhere’s mission is to make the process of selecting a location a little bit easier by predicting the number of properties that will receive a permit for new construction or major alterations to existing buildings, our metric for development. Behind these predictions is a complex model that makes predictions based on property-level data and amenity/disamenity data. What the user will see, though, is an interactive map that displays the predicted number of properties that will be issued permits in each cell. Our insights will take the guesswork out of choosing where to start a business, helping more businesses succeed and motivating growth and innovation in our city.

opa_permits <- data_opa_raw %>% 
  # Transform to projected coordinates
  st_transform("EPSG:6565") %>% 
  left_join(relevant_permits,
            by = c(parcel_number = "opa_account_num")) %>% 
  mutate(permit_2014_2018 = 
           case_when(str_detect(permit_years, "2014|2015|2016|2017|2018") ~ TRUE,
                     .default = FALSE)) %>% 
  mutate(permit_2019_2024 = 
           case_when(str_detect(permit_years, "2019|2020|2021|2022|2023|2024") ~ TRUE,
                     .default = FALSE))

           # > rows only in x   540,662
           # > rows only in y  (  1,293)
           # > matched rows      43,336
           # >                 =========
           # > rows total       583,998
phila_boundary <- counties(state = "PA", progress_bar = FALSE) %>% 
  filter(NAMELSAD == "Philadelphia County") %>% 
  st_transform("EPSG:6565")

hexcells <- st_make_grid(phila_boundary, cellsize = 1320, square = FALSE) %>%
  .[phila_boundary] %>%           
  st_sf() %>%
  mutate(cell_id = 1:n())

# mapview::mapview(hexcells)

Modeling approach

Strategy and specifications

Our modeling strategy tackles the problem in two steps. First, we predict the probability of an individual parcel being issued permits for new construction or major alterations, through a logistic regression model applied to every parcel in the City of Philadelphia. Then, we aggregate these probabilities for every parcel within a cell in a hexagonal grid of Philadelphia, to yield a final small area-level estimate for the count of properties with development activity.

This strategy allows us to leverage the rich property-level data available from Philadelphia’s Office of Property Assessments and Department of Licenses and Inspections, as well as property-level predictors relating to amenities and disamenities. These property-level variables (intended to capture the attractiveness of development in a particular location) in our model are:

  • Total area of parcel
  • Zoning category
  • Market value
  • OPA parcel category
  • Neighborhood
  • School catchment
  • Distance to nearest park
  • Distance to nearest SEPTA metro and regional rail station
  • Distance to nearest crime locations (separately for murder, violent crimes, property crimes, and quality-of-life crimes)

In addition to these property-level variables, our logistic regression model also includes predictors at the small-area level, which are:

  • Census tract median income
  • Census tract homeowner percentage
  • Census tract White population percentage
  • Properties with permits in the grid cell of the property
  • Properties with permits in the grid cell of the property 2014-2018
  • Average properties with permits in cells surrounding the grid cell of the property
  • Average properties with permits in cells surrounding the grid cell of the property, 2014-2018

The last 4 cell-based predictors provide additional spatial and spatiotemporal information for the model, allowing development activity in nearby areas and in previous years to influence the prediction.

The logistic regression model was trained to predict the probability of a parcel being issued a development permit in the period 2019-2024. After cross-validation of the trained logistic regression model (see below), we sum the property-level probabilities over all properties in a grid cell, then round to the nearest integer at the cell level, for our final prediction which is communicated to the end-user.

The aggregation allows us to improve model accuracy, especially as the rate of development permitting is low (at around 4% of properties in our dataset). Even if the prediction at the property level is quite uncertain, aggregating over many such properties will yield more statistically stable estimates. Further, information at the small-area level will be easier to effectively communicate to the end-user, and be more relevant for users who are interested in area-level changes rather than changes to an individual property.

# For all selected permits 2014-2018
permits_net_2014_2018 <- opa_permits %>% 
  filter(permit_2014_2018 == TRUE) %>% 
  mutate(count_permits_2014_2018 = 1) %>% 
  select(count_permits_2014_2018) %>% 
  aggregate(hexcells, sum) %>%
  mutate(count_permits_2014_2018 = replace_na(count_permits_2014_2018, 0),
         cell_id = 1:n())

# For all selected permits 2019-2024
permits_net_2019_2024 <- opa_permits %>% 
  filter(permit_2019_2024 == TRUE) %>% 
  mutate(count_permits_2019_2024 = 1) %>% 
  select(count_permits_2019_2024) %>% 
  aggregate(hexcells, sum) %>%
  mutate(count_permits_2019_2024 = replace_na(count_permits_2019_2024, 0),
         cell_id = 1:n())
# Neighbor cell IDs
cell_neighbor_ids <- spdep::poly2nb(permits_net_2019_2024)

# Get neighbor permit means for particular cell
get_hexcell_neighbor_mean <- function(data, id) {
  
  neighbors <- cell_neighbor_ids[[id]]
  
  mean <- data %>% 
    st_drop_geometry() %>% 
    filter(cell_id %in% neighbors) %>% 
    pull(all_of(contains("count"))) %>% 
    mean(na.rm = TRUE)
  
  output <- tibble(cell_id = id, neighbor_mean = mean)
  
}

# Get neighbor permit means for all cells
permits_neighbors_2019_2024 <- permits_net_2019_2024$cell_id %>% 
  map_dfr(~get_hexcell_neighbor_mean(permits_net_2019_2024, .))
  
permits_neighbors_2014_2018 <- permits_net_2014_2018$cell_id %>% 
  map_dfr(~get_hexcell_neighbor_mean(permits_net_2014_2018, .))

# Join neighbor means to original net
permits_net_with_neighbors_2019_2024 <- permits_net_2019_2024 %>% 
  left_join(permits_neighbors_2019_2024) %>% 
  rename(cell_permits_2019_2024 = count_permits_2019_2024,
         cell_permits_neighbor_mean_2019_2024 = neighbor_mean)

permits_net_with_neighbors_2014_2018 <- permits_net_2014_2018 %>% 
  left_join(permits_neighbors_2014_2018) %>% 
  rename(cell_permits_2014_2018 = count_permits_2014_2018,
         cell_permits_neighbor_mean_2014_2018 = neighbor_mean)
opa_with_cellcounts <- opa_permits %>% 
  # Exclude rows with no geospatial information
  # filter: removed 214 rows (<1%), 583,821 rows remaining
  filter(!st_is_empty(.)) %>% 
  st_join(hexcells) %>% 
  left_join(permits_net_with_neighbors_2019_2024 %>% 
              st_drop_geometry(),
            by = "cell_id") %>% 
  left_join(permits_net_with_neighbors_2014_2018 %>% 
              st_drop_geometry(),
            by = "cell_id") %>% 
  select(parcel_number, cell_id, total_area, zoning, market_value, category_code, contains("permit"), -permit_years)
  
# Minimal dataframe for spatial operations only
opa_geometry <- opa_with_cellcounts %>% 
  select(parcel_number) 
# Shapefile
tracts_initial <- 
  tracts(state = "PA", county = "Philadelphia", year = 2022) %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565") 

# Initial data
demographics_tract_initial <- 
  get_acs(geography = "tract",
          variables = 
            c(# Median household income
              "B19013_001", 
              # Tenure: total and owner
              "B25003_001", "B25003_002", 
              # Race/Ethnicity: total and non-hispanic white
              "B03002_001", "B03002_003"),
          year = 2022,
          output = "wide",
          state = "PA",
          county = "Philadelphia",
          geometry = TRUE,
          survey = "acs5") 

# Derivative variables calculated and selected
demographics_tract_processed <- demographics_tract_initial %>% 
  select(geoid_tract = GEOID, matches("E$"), -NAME) %>% 
  rename(median_income = B19013_001E,
         tenure_total = B25003_001E,
         tenure_owner = B25003_002E,
         race_total = B03002_001E,
         race_white_only = B03002_003E) %>% 
  mutate(owner_percent = tenure_owner / tenure_total) %>% 
  mutate(white_only_percent = race_white_only / race_total) %>% 
  mutate(poc_percent = 1 - white_only_percent) %>% 
  select(geoid_tract, 
         median_income_tract = median_income,
         owner_percent_tract = owner_percent,
         white_only_percent_tract = white_only_percent,
         poc_percent_tract = poc_percent) %>% 
  mutate(across(where(is.numeric), ~if_else(is.finite(.), ., NA)))

# Imputing missing values based on mean of values from neighboring geographies:
# For each variable, filter to NAs, join values from neighboring geographies, take the mean
impute_acs <- function(data) {
  
  impute_variable <- function(data, geoid_col, var) {
    
    data_imputed <- data %>% 
      filter(is.na(.data[[var]])) %>% 
      select(all_of(geoid_col)) %>% 
      st_join(data %>% 
                select(all_of(var)),
              join = st_touches) %>% 
      filter(!is.na(.data[[var]])) %>% 
      st_drop_geometry() %>% 
      summarize(.by = all_of(geoid_col),
                "{var}" := mean(.data[[var]]))
    
    data_output <- data %>%
      select(all_of(c(geoid_col, var))) %>%
      left_join(data_imputed, by = geoid_col) %>%
      mutate("{var}" := coalesce(.data[[str_c(var, ".x")]], .data[[str_c(var, ".y")]])) %>%
      select(all_of(c(geoid_col, var))) %>% 
      st_drop_geometry()
    
  }
  
  names <- names(data)
  
  geoid_col <- names[str_detect(names, "geoid")]
  
  vars <- data %>% 
    st_drop_geometry() %>% 
    summarize_all(~sum(is.na(.))) %>%
    select_if(. > 0) %>% 
    names()
  
  output <- vars %>% 
    map(~ impute_variable(data, geoid_col, .)) %>% 
    reduce(left_join, by = geoid_col)
  
}

# Imputing missing values based on mean of values from neighboring geographies
demographics_tract_processed_imputed <- impute_acs(demographics_tract_processed)
neighborhoods_initial <-
  st_read("https://github.com/opendataphilly/open-geo-data/raw/refs/heads/master/philadelphia-neighborhoods/philadelphia-neighborhoods.geojson") %>% 
  clean_names() %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565") 

park_boundaries <-   
  st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson") %>% 
  clean_names() %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565") 

park_distances_indices <- opa_geometry %>% 
  st_nearest_feature(park_boundaries)

park_distances_vector <- 
  st_distance(opa_geometry$geometry, 
              park_boundaries$geometry[park_distances_indices], 
              by_element = TRUE)

park_distances <- opa_geometry %>% 
  mutate(nearest_id = park_distances_indices) %>% 
  mutate(nearest_name = park_boundaries$public_name[nearest_id]) %>% 
  mutate(distance_to_nearest_park = as.numeric(park_distances_vector))

school_catchments <- 
  st_read("https://services.arcgis.com/fLeGjb7u4uXqeF9q/arcgis/rest/services/SchoolDist_Catchments_ES/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>% 
  clean_names() %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565") 

septa_subway <- 
  st_read("https://opendata.arcgis.com/api/v3/datasets/af52d74b872045d0abb4a6bbbb249453_0/downloads/data?format=geojson&spatialRefId=4326") %>% 
  filter(Route != "Norristown Highspeed Line")%>% 
  rename(station_name = Station,
         line = Route) %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565")

septa_subway_distances_indices <- opa_geometry %>% 
  st_nearest_feature(septa_subway)

septa_subway_distances_vector <- 
  st_distance(opa_geometry$geometry, 
              septa_subway$geometry[septa_subway_distances_indices], 
              by_element = TRUE)

septa_subway_distances <- opa_geometry %>% 
  mutate(nearest_id = septa_subway_distances_indices) %>% 
  mutate(nearest_name = septa_subway$station_name[nearest_id]) %>% 
  mutate(distance_to_nearest_septa_subway = as.numeric(septa_subway_distances_vector))

septa_regional_rail <- 
  st_read("https://opendata.arcgis.com/api/v3/datasets/cf49c62fbc17430aa818e900556d207e_0/downloads/data?format=geojson&spatialRefId=4326") %>% 
  rename(station_name = Station_Na,
         line = Line_Name) %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565") 

septa_regional_rail_distances_indices <- opa_geometry %>% 
  st_nearest_feature(septa_regional_rail)

septa_regional_rail_distances_vector <- 
  st_distance(opa_geometry$geometry, 
              septa_regional_rail$geometry[septa_regional_rail_distances_indices], 
              by_element = TRUE)

septa_regional_rail_distances <- opa_geometry %>% 
  mutate(nearest_id = septa_regional_rail_distances_indices) %>% 
  mutate(nearest_name = septa_regional_rail$station_name[nearest_id]) %>% 
  mutate(distance_to_nearest_septa_regional_rail = as.numeric(septa_regional_rail_distances_vector))

septa_trolley <- 
  st_read("https://opendata.arcgis.com/api/v3/datasets/dd2afb618d804100867dfe0669383159_0/downloads/data?format=geojson&spatialRefId=4326") %>% 
  rename(station_name = StopName,
         line = LineAbbr) %>% 
  # Transform CRS
  st_transform(crs = "EPSG:6565") 

septa_trolley_distances_indices <- opa_geometry %>% 
  st_nearest_feature(septa_trolley)

septa_trolley_distances_vector <- 
  st_distance(opa_geometry$geometry, 
              septa_trolley$geometry[septa_trolley_distances_indices], 
              by_element = TRUE)

septa_trolley_distances <- opa_geometry %>% 
  mutate(nearest_id = septa_trolley_distances_indices) %>% 
  mutate(nearest_name = septa_trolley$station_name[nearest_id]) %>% 
  mutate(distance_to_nearest_septa_trolley = as.numeric(septa_trolley_distances_vector))

crime_initial <-
  read_csv("https://phl.carto.com/api/v2/sql?filename=incidents_part1_part2&format=csv&q=SELECT%20*%20,%20ST_Y(the_geom)%20AS%20lat,%20ST_X(the_geom)%20AS%20lng%20FROM%20incidents_part1_part2%20WHERE%20dispatch_date_time%20%3E=%20%272023-01-01%27%20AND%20dispatch_date_time%20%3C%20%272024-01-01%27") %>%
  clean_names() %>%
  dplyr::select(text_general_code,lat, lng) %>%
  na.omit() %>%
  st_as_sf(coords = c('lng', 'lat')) %>% 
  st_set_crs("EPSG: 4326") %>%
  st_transform(crs = "EPSG:6565") %>%
  rename(description = text_general_code) %>% 
  mutate(crime_category =
           case_when(
             description %in%
               c("Homicide - Criminal") ~
               "murder",
             description %in%
               c("Aggravated Assault Firearm", "Aggravated Assault No Firearm", "Other Assaults",
                 "Homicide - Justifiable",
                 "Rape", "Other Sex Offenses (Not Commercialized)",
                 "Offenses Against Family and Children",
                 "Robbery Firearm", "Robbery No Firearm",
                 "Weapon Violations") ~
               "violent",
             description %in%
               c("Arson", "Burglary Non-Residential", "Burglary Residential",
                 "Motor Vehicle Theft", "Theft from Vehicle", "Thefts") ~
               "property",
             description %in%
               c("Disorderly Conduct", "Gambling Violations", "Liquor Law Violations",
                 "Narcotic / Drug Law Violations", "Prostitution and Commercialized Vice",
                 "Public Drunkenness", "Receiving Stolen Property", "Vagrancy/Loitering",
                 "Vandalism/Criminal Mischief") ~
               "quality_of_life",
             .default = "other"))

crime_murder <- crime_initial %>% 
  filter(crime_category == "murder")

crime_violent <- crime_initial %>% 
  filter(crime_category == "violent")

crime_property <- crime_initial %>% 
  filter(crime_category == "property")

crime_quality_of_life <- crime_initial %>% 
  filter(crime_category == "quality_of_life")

# Get a suite of KNN measures for each crime type
crime_distances <- opa_geometry %>% 
  mutate(crime_murder_nn_1 = 
           nn_function(st_coordinates(opa_geometry),
                       st_coordinates(crime_murder),
                       k = 1)) %>% 
  mutate(crime_violent_nn_1 = 
           nn_function(st_coordinates(opa_geometry),
                       st_coordinates(crime_violent),
                       k = 1)) %>% 
  mutate(crime_property_nn_1 = 
           nn_function(st_coordinates(opa_geometry),
                       st_coordinates(crime_property),
                       k = 1)) %>% 
  mutate(crime_quality_of_life_nn_1 = 
           nn_function(st_coordinates(opa_geometry),
                       st_coordinates(crime_quality_of_life),
                       k = 1))
data_ready <- opa_with_cellcounts %>% 
  st_join(neighborhoods_initial %>% 
            select(neighborhood = mapname)) %>% 
  st_join(tracts_initial %>% 
            select(geoid_tract = GEOID)) %>% 
  left_join(demographics_tract_processed_imputed, by = "geoid_tract") %>% 
  st_join(school_catchments %>% 
            select(school = es_name)) %>% 
  left_join(park_distances %>% 
              st_drop_geometry() %>% 
              select(parcel_number, distance_to_nearest_park)) %>% 
  left_join(septa_subway_distances %>% 
              st_drop_geometry() %>% 
              select(parcel_number, distance_to_nearest_septa_subway)) %>% 
  left_join(septa_regional_rail_distances %>% 
              st_drop_geometry() %>% 
              select(parcel_number, distance_to_nearest_septa_regional_rail)) %>% 
  left_join(septa_trolley_distances %>% 
              st_drop_geometry() %>% 
              select(parcel_number, distance_to_nearest_septa_trolley)) %>%
  left_join(crime_distances %>%
              st_drop_geometry())
data_ready <- read_rds(str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "data_modelling.rds", sep = "/"))

# Smaller sample of data that can be plotted faster
data_plotting <- data_ready %>% 
  slice_sample(prop = 0.1)

Exploratory Analysis

The following maps explore the spatial distribution of our indicators. First, a map of the counts of permitted properties in each of our quarter-mile hex cells for the first five years of data is shown next to the number of permits for the properties in the neighboring cells.

grid.arrange(hexcells %>% 
  left_join(data_ready %>% st_drop_geometry() %>% select(cell_permits_2014_2018, cell_id)) %>%
  group_by(cell_permits_2014_2018, cell_id) %>% summarize() %>% 
  ggplot() + 
  geom_sf(aes(fill=cell_permits_2014_2018), color=NA) +
  scale_fill_gradient(name = "Permitted properties in cell",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title = "Permitted Properties, 2014-2018")+
  theme_void(),
  
  hexcells %>% 
  left_join(data_ready %>% st_drop_geometry() %>% select(cell_permits_neighbor_mean_2014_2018, cell_id)) %>%
  group_by(cell_permits_neighbor_mean_2014_2018, cell_id) %>% summarize() %>% 
  ggplot() + 
  geom_sf(aes(fill=cell_permits_neighbor_mean_2014_2018), color=NA) +
  scale_fill_gradient(name = "Mean permitted properties in neighbors",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title = "Mean Number of Permitted Properties for Neighbor Cells, 2014-2018")+
  theme_void())

This shows a concentration of permitted properties and their neighbors in areas near Center City, with other parts of the city having less permitted properties. Areas with a high density of permitted properties also tend to be near other high-development areas.

The pattern is largely similar, but not identical to, the spatial distribution in more recent permits.

grid.arrange(hexcells %>% 
  left_join(data_ready %>% st_drop_geometry() %>% select(cell_permits_2019_2024, cell_id)) %>%
  group_by(cell_permits_2019_2024, cell_id) %>% summarize() %>% 
  ggplot() + 
  geom_sf(aes(fill=cell_permits_2019_2024), color=NA) +
  scale_fill_gradient(name = "Permitted properties in cell",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title = "Permitted Properties, 2019-2024")+
  theme_void(),
  
  hexcells %>% 
  left_join(data_ready %>% st_drop_geometry() %>% select(cell_permits_neighbor_mean_2019_2024, cell_id)) %>%
  group_by(cell_permits_neighbor_mean_2019_2024, cell_id) %>% summarize() %>% 
  ggplot() + 
  geom_sf(aes(fill=cell_permits_neighbor_mean_2019_2024, color=NA)) +
  scale_fill_gradient(name = "Mean permitted properties in neighbors",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title = "Mean Number of Permitted Properties for Neighbor Cells, 2019-2024")+
  theme_void())

Another indicator we will predict on is distance to transit, specifically the trolley routes, regional rail, and heavy rail (Market-Frankford Line & Broad Street Line), as distance to transit may motivate certain types of development depending on the area. Our maps show that many properties in Philadelphia are located relatively close to transit, with outer parts of the city having larger distances. In our areas with high counts, distance to transit is among the lowest.

grid.arrange(top="Distance to Transit",nrow=2,
  ggplot(data_plotting)+
  geom_sf(aes(color=distance_to_nearest_septa_subway))+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Heavy Rail")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=distance_to_nearest_septa_trolley))+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Trolley Routes")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=distance_to_nearest_septa_regional_rail))+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Regional Rail")+
  theme_void())

grid.arrange(top="Distance to Transit (FALSE indicates no permit, TRUE indicates permit)",nrow=2,
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=distance_to_nearest_septa_subway))+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Heavy Rail")+
  facet_wrap(~permit_2014_2018)+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=distance_to_nearest_septa_trolley))+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Trolley Routes")+
  facet_wrap(~permit_2014_2018)+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=distance_to_nearest_septa_regional_rail))+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Regional Rail")+
  facet_wrap(~permit_2014_2018)+
  theme_void())

When separating for properties that received permits versus those who didn’t, the pattern is very similar since the routes are fixed and permits are separated throughout the city.

grid.arrange(top="Parcel distance from nearest crime incidents", nrow=2,
  ggplot(data_plotting)+
  geom_sf(aes(color=crime_murder_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Murder")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=crime_violent_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Violent crime")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=crime_property_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Property-related crime")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=crime_quality_of_life_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Quality-of-life crime")+
  theme_void())

We filtered crime data for the city to fall under four categories: Murder, Violent Crime, Property-related crime, and Quality-of-Life Crime. Mapped above are each parcel’s distance to the nearest crime under each category. We see that parcels are located relatively closely to crime, no matter the category, with no visibly significant spatial patterns aside from outer parts of the city having greater distance.

grid.arrange(top="Parcel distance from nearest crime incidents by development pattern",
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=crime_murder_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  facet_wrap(~permit_2014_2018)+
  labs(title="Murder", subtitle="FALSE indicates no permit, TRUE indicates permit")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=crime_violent_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  facet_wrap(~permit_2014_2018)+
  labs(title="Violent Crime")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=crime_property_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  facet_wrap(~permit_2014_2018)+
  labs(title="Property-related crime")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=crime_quality_of_life_nn_1), size=0.3)+
  scale_color_gradient(name = "Distance (feet)",low="#fdb863", high="#5e3c99", na.value = "white")+
  facet_wrap(~permit_2014_2018)+
  labs(title="Quality-of-life Crime")+
  theme_void())

As with the permit/no permit split for transit, crimes are located near properties all throughout the city, with spatial patterns not being visually significant.

grid.arrange(ggplot(data_plotting)+
  geom_sf(aes(color=median_income_tract), size=0.3)+
  scale_color_gradient(name = "Income",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Median Income")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=owner_percent_tract), size=0.3)+
  scale_color_gradient(name = "Percent",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Percentage of households who own home")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=white_only_percent_tract), size=0.3)+
  scale_color_gradient(name = "Percent",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Percent white population")+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(aes(color=poc_percent_tract), size=0.3)+
  scale_color_gradient(name = "Percent",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Percent non-white population")+
  theme_void())

Demographic patterns show that race tends to cluster in space; areas with high white population tend to be near other areas with high white population, and the same goes for non-white population in an area. Income is similarly clustered, but less so, with few areas in the city having high median income. The percentage of households owning their home varies throughout space, with lower ownership rates in areas near Center City.

grid.arrange(ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=median_income_tract), size=0.3)+
  scale_color_gradient(name = "Income",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Median Income")+
  facet_wrap(~permit_2014_2018)+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=owner_percent_tract), size=0.3)+
  scale_color_gradient(name = "Percent",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Percentage of households who own home")+
  facet_wrap(~permit_2014_2018)+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=white_only_percent_tract), size=0.3)+
  scale_color_gradient(name = "Percent",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Percent white population")+
  facet_wrap(~permit_2014_2018)+
  theme_void(),
  
  ggplot(data_plotting)+
  geom_sf(data=phila_boundary, fill="white")+
  geom_sf(aes(color=poc_percent_tract), size=0.3)+
  scale_color_gradient(name = "Percent",low="#fdb863", high="#5e3c99", na.value = "white")+
  labs(title="Percent non-white population")+
  facet_wrap(~permit_2014_2018)+
  theme_void())

Unlike the two permit/no permit split maps above, the demographic patterns do differ among the different development types. Permitted properties are more often located in areas with higher homeownership rates and higher incomes.

data_ready %>% 
  st_drop_geometry() %>%
  group_by(neighborhood) %>% 
  filter(permit_2014_2018==TRUE) %>% 
  summarize(Permits=n()) %>% arrange() %>% slice_max(Permits, n = 10) %>%
  ggplot(aes(x=neighborhood, y= Permits))+
  geom_bar(stat="identity", fill="#b2abd2")+
  coord_flip()+
  labs(title = "Neighborhoods with the most permitted properties, 2014-2018", 
       subtitle="Permit counts based off of number of parcels in neighborhood that received at least one permit", 
       x=NULL,
       y = "Permitted properties")+
  theme_minimal()+
  theme(plot.title = element_text(size=15),
        plot.subtitle = element_text(face="italic",size=8))

The above chart shows the names of the neighborhoods that had the most permitted properties in the first five years of our data. Many neighborhoods are located near Center City or surrounding like Lower Kensington, with Point Breeze seeing the most development in this period.

Modelling and validation

Our full dataset was divided into training and testing sets, with the training set comprising 25% of the full data and the testing set comprising 75%. The training set data was randomly assigned, but containing the same percentage of observations with permits.

# # Prepare data for modeling
# data_modelling <- data_ready %>% 
#   # Remove observations with any NAs
#   na.omit() %>% 
#   # Factorize all logicals
#   mutate(across(where(is.logical), ~ fct_relevel(as.factor(.), "FALSE", "TRUE")))

data_modelling <- read_rds(str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "data_modelling.rds", sep = "/"))

# Version with no geometry, since modeling doesn't use geometry
data_modelling_no_geometry <- data_modelling %>% 
  st_drop_geometry()

# Defining testing and training sets (note training portion is only 25%, since dataset is so large)
set.seed("2718")
modelling_split <- initial_split(data_modelling_no_geometry, prop = 0.25, strata = "permit_2019_2024")

modelling_train <- training(modelling_split)

modelling_test <- testing(modelling_split)
# Using logistic regression as our model type
model_engine <- logistic_reg() %>% set_engine("glm")

# Model recipe
recipe_full <- recipe(modelling_train) %>% 
  # Outcome
  update_role(permit_2019_2024, new_role = "outcome") %>% 
  # Predictors
  update_role(
    # OPA variables
    total_area, zoning, market_value, category_code, 
    # Spatial lag
    cell_permits_2019_2024, cell_permits_neighbor_mean_2019_2024, cell_permits_neighbor_mean_2014_2018,
    # Temporal lag
    cell_permits_2014_2018,
    # Other variables
    neighborhood, 
    median_income_tract, owner_percent_tract, white_only_percent_tract, 
    school, 
    distance_to_nearest_park, distance_to_nearest_septa_subway, 
    distance_to_nearest_septa_regional_rail, distance_to_nearest_septa_trolley,
    crime_murder_nn_1, crime_violent_nn_1, crime_property_nn_1, crime_quality_of_life_nn_1,
    new_role = "predictor") %>% 
  # Handling factor levels that could be unseen in new data
  step_other(zoning) %>% 
  # Numeric transformations
  step_log(market_value, contains("distance"), contains("_nn_"),
           offset = 1) %>% 
  step_dummy(all_nominal_predictors()) 
# Defining tidymodels workflow
model_workflow <- workflow() %>%
  add_model(model_engine) %>% 
  add_recipe(recipe_full)

# Fitted model
training_fit_workflow <- fit(model_workflow, modelling_train)

# Model fit object
training_fit_only <- extract_fit_parsnip(training_fit_workflow)

# # Write out
# write_rds(training_fit_workflow, str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "training_workflow.rds", sep = "/"))

# # Write out
# write_rds(training_fit_only, str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "training_fit_only.rds", sep = "/"))
# Setting up cross-validation folds
set.seed("2718")
data_folds <- vfold_cv(modelling_train, v = 10, strata = "permit_2019_2024")

# Specify metrics with correct reference level 
sensitivity_fixed <- metric_tweak("sensitivity", sensitivity, event_level = "second")
specificity_fixed <- metric_tweak("specificity", specificity, event_level = "second")

# Run validation resamples (takes ~ 0.5 hour)
set.seed("2718")
resamples <- fit_resamples(model_workflow,
                          data_folds,
                          metrics = metric_set(sensitivity_fixed, specificity_fixed, roc_auc),
                          control = control_resamples(save_pred = TRUE))
# Read
training_fit_workflow <- read_rds(str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "training_workflow.rds", sep = "/"))
training_fit_only <- read_rds(str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "training_fit_only.rds", sep = "/"))
resamples <- read_rds(str_c(Sys.getenv("BOX_PATH"), "musa-508-final-data", "training_resamples.rds", sep = "/"))
# Validation metrics
resample_metrics <- collect_metrics(resamples)

# Cross validation predictions
resample_predictions <- resamples$.predictions %>% 
  list_rbind(names_to = "fold")

# Aggregate predictions in fishnet
resample_predictions_joined <- resample_predictions %>% 
  select(fold, row_id = .row, permit_2019_2024, permit_prob = .pred_TRUE) %>% 
  left_join(modelling_train %>% 
              mutate(row_id = row_number()) %>% 
              select(row_id, cell_id),
            by = "row_id")

# Compare to actual counts
resample_predictions_fishnet_aggregate <- resample_predictions_joined %>% 
  group_by(fold, cell_id) %>% 
  summarize(actual_permits = sum(as.numeric(permit_2019_2024) - 1),
            predicted_permits = round(sum(permit_prob, na.rm = TRUE))) %>% 
  ungroup() %>% 
  mutate(absolute_error = abs(actual_permits - predicted_permits))

# Summary across resampling folds
resample_predictions_fishnet_aggregate_summary <- resample_predictions_fishnet_aggregate %>% 
  summarize(.by = fold, mean_absolute_error = mean(absolute_error))

# MAE across training folds
# 0.2444205
# mean(resample_predictions_fishnet_aggregate$absolute_error)

After model training, we performed 10-fold cross-validation on the training data to evaluate the initial performance of the logistic regression model. The mean area under the ROC across the validation folds was 0.776, a moderately high value. After aggregating the predicted permit probabilities, we observed a mean absolute error in the cell count of parcels with development permits of 0.244, meaning that on average, predicted permitted property counts in a cell differed from actual permitted properties by less than 1.

The table below shows MAE values for each cross-validation fold.

# Display table
resample_predictions_fishnet_aggregate_summary %>% 
  gt_print() %>% 
  fmt_number(mean_absolute_error, decimals = 3) %>% 
  tab_header("Mean absolute error of aggregated permitted property counts by cell")
Mean absolute error of aggregated permitted property counts by cell
Fold Mean absolute error
1 0.233
2 0.259
3 0.228
4 0.240
5 0.246
6 0.251
7 0.249
8 0.264
9 0.239
10 0.235
# Generate predictions
test_predictions <- predict(training_fit_workflow, modelling_test, type = "prob") %>% 
  bind_cols(modelling_test) %>% 
  rename(permit_probability = .pred_TRUE)

# Validation metrics
# ROC AUC: 0.7796151
test_validation <- 
  roc_auc_vec(test_predictions$permit_2019_2024, test_predictions$permit_probability, 
              event_level = "second")

# Aggregate predictions and errors
test_predictions_fishnet_aggregate <- test_predictions %>% 
  group_by(cell_id) %>% 
  summarize(actual_permits = sum(as.numeric(permit_2019_2024) - 1),
            predicted_permits = round(sum(permit_probability))) %>% 
  ungroup() %>% 
  mutate(error = actual_permits - predicted_permits) %>% 
  mutate(absolute_error = abs(error))

# MAE at fishnet level
# 2.107957
# mean(test_predictions_fishnet_aggregate$absolute_error)

The model’s performance on testing data were similarly good, with an area under the ROC of 0.780. The mean absolute error of actual versus predicted permitted property counts at the cell level was 2.11, higher than in the cross-validation because the size of the testing set was about 30 times larger and thus permit counts per cell, as well as predicted counts, were larger in magnitude.

The maps below enable comparisons of aggregated model predictions across the city, showing first the predicted values, then the actual values. The spatial pattern of predicted permitted properties is very similar to the actual data.

# Map of predictions and errors
test_predictions_map_data <- hexcells %>% 
  inner_join(test_predictions_fishnet_aggregate, by = "cell_id")

#
mapview(test_predictions_map_data, zcol = "predicted_permits", at = c(0, 3, 10, 50, 100, 150, 300),
        popup=popupTable(test_predictions_map_data, zcol = "predicted_permits",feature.id = FALSE),legend=TRUE, 
        layer.name = "Predicted Permitted Properties")
mapview(test_predictions_map_data, zcol = "actual_permits", at = c(0, 3, 10, 50, 100, 150, 300),legend=TRUE, 
        popup=popupTable(test_predictions_map_data, zcol = "actual_permits",feature.id = FALSE),
        layer.name = "Actual Permitted Properties")

Looking more closely at the patterns of error, we see that errors are higher in cells with more development activity. In part, this is due to our final outcome measure being counts; the larger the count, the more room for error.

mapview(test_predictions_map_data,zcol = "error", 
        at = c(-50, -25, -5, 5, 25, 50), 
        col.regions = list("#e66101", "#fdb863", "#f7f7f7", "#b2abd2", "#5e3c99"),
        legend=TRUE, layer.name="Error")

This said, we observe that overpredictions are about as likely as underpredictions, and that the spatial pattern of errors does not show a clear aggregate pattern for over- vs. underpredictions.

test_predictions_fishnet_aggregate %>% 
  ggplot(aes(x = error)) +
  geom_histogram() +
  labs(title = "Histogram of prediction error size by cell",
       subtitle = "Negative values are overpredictions, positive values underpredictions",
       x = "Error",
       y = "Number of cells")

Looking more closely at the generalizability of our predictions, we can break down model errors by the demographics of each cell (we take the tract-level demographic status of the majority of properties in each cell).

The plot below shows that our model produced excellent predictions (absolute error or 2 or less per cell) for the vast majority of cells, regardless of cell demographics, though errors were more likely for lower-income cells, majority-POC cells, and majority-renter cells. Even in cases of larger errors, however, overpredictions were nearly as likely as underpredictions across demographic characteristics.

test_demographics <- modelling_test %>% 
  mutate(income = 
           if_else(median_income_tract < median(modelling_test$median_income_tract),
                   "Below median income", "Above median income")) %>% 
  mutate(race =
           if_else(poc_percent_tract > 0.5, "Majority POC", "Not majority POC")) %>% 
  mutate(tenure =
           if_else(owner_percent_tract < 0.5, "Majority renter", "Not majority renter"))

test_income_summary <- test_demographics %>% 
  group_by(cell_id, income) %>% 
  summarize(n = n()) %>% 
  slice_max(order_by = n, with_ties = FALSE) %>% 
  select(-n)

test_race_summary <- test_demographics %>% 
  group_by(cell_id, race) %>% 
  summarize(n = n()) %>% 
  slice_max(order_by = n, with_ties = FALSE) %>% 
  select(-n)

test_tenure_summary <- test_demographics %>% 
  group_by(cell_id, tenure) %>% 
  summarize(n = n()) %>% 
  slice_max(order_by = n, with_ties = FALSE) %>% 
  select(-n)

test_generalizability <- test_predictions_fishnet_aggregate %>% 
  mutate(cell_status = case_when(error < -2 ~ "Overprediction",
                                 error >= -2 & error <= 2 ~ "Close prediction",
                                 error > 2 ~ "Underprediction")) %>% 
  mutate(cell_status = 
           fct_relevel(cell_status, "Overprediction", "Close prediction", "Underprediction")) %>% 
  left_join(test_income_summary) %>% 
  left_join(test_race_summary) %>% 
  left_join(test_tenure_summary) %>% 
  pivot_longer(c(income, race, tenure), names_to = "variable", values_to = "category") %>% 
  mutate(variable = str_to_sentence(variable))

test_generalizability %>% 
  group_by(variable, category, cell_status) %>% 
  summarize(n = n()) %>% 
  mutate(percent = n / sum(n)) %>% 
  ggplot(aes(x = category, y = percent, fill = cell_status)) +
  geom_col(position = position_dodge2()) +
  geom_text(aes(label = label_percent(accuracy = 1)(percent)), 
            position = position_dodge2(width = 0.8),
            vjust = -0.5) +
  scale_y_continuous(label = label_percent()) +
  scale_fill_manual(values = c("#e66101", "gray", "#5e3c99")) +
  facet_wrap(vars(variable), scales = "free_x") +
  labs(title = "Prediction accuracy by cell demographics",
       x = NULL,
       y = "Percent of cells",
       fill = "Prediction accuracy",
       caption = "Close prediction means absolute error of less than 2 in cell") +
  theme(legend.position = "bottom")

Final model predictions

# Generate predictions
overall_predictions <- predict(training_fit_workflow, data_modelling_no_geometry, type = "prob") %>% 
  bind_cols(data_modelling_no_geometry) %>% 
  rename(permit_probability = .pred_TRUE)

# Validation metrics
# ROC AUC: 0.783222
overall_validation <- 
  roc_auc_vec(overall_predictions$permit_2019_2024, overall_predictions$permit_probability, 
              event_level = "second")

# Aggregate predictions and errors
overall_predictions_fishnet_aggregate <- overall_predictions %>% 
  group_by(cell_id) %>% 
  summarize(actual_permits = sum(as.numeric(permit_2019_2024) - 1),
            predicted_permits = round(sum(permit_probability))) %>% 
  ungroup() %>% 
  mutate(error = actual_permits - predicted_permits) %>% 
  mutate(absolute_error = abs(error))

# MAE at fishnet level
# 2.414871
# mean(overall_predictions_fishnet_aggregate$absolute_error)

# Map of predictions and errors
overall_predictions_map_data <- hexcells %>% 
  inner_join(overall_predictions_fishnet_aggregate, by = "cell_id")

Below, we show a map of final modelled predictions for the entire dataset, showing the predicted number of properties with a development permit per grid cell…

mapview(overall_predictions_map_data, zcol = "predicted_permits", at = c(0, 3, 10, 50, 100, 150, 300),
        popup=popupTable(overall_predictions_map_data, zcol = c("cell_id","predicted_permits"),
                         feature.id =FALSE),legend=TRUE, 
        layer.name = "Predicted Permitted Properties")

… as well as the actual permitted property counts …

mapview(overall_predictions_map_data, zcol = "actual_permits", at = c(0, 3, 10, 50, 100, 150, 300),
        popup=popupTable(overall_predictions_map_data, zcol = c("cell_id","actual_permits"),
                         feature.id =FALSE),legend=TRUE, 
        layer.name = "Actual Permitted Properties")

… and the pattern of errors.

mapview(overall_predictions_map_data, zcol = "error", 
        at = c(-60, -25, -5, 5, 25, 60), 
        col.regions = list("#e66101", "#fdb863", "#f7f7f7", "#b2abd2", "#5e3c99"), legend = TRUE, layer.name="Error")

Discussion

Our model was able to successfully predict spatial patterns of new development in Philadelphia at the small-area level, with a large majority of cells having little to no error. The model therefore provides well-calibrated estimates of development activity, useful for guiding strategic decisions for small businesses seeking the best locations to target.

However, there are some areas with larger errors, especially in areas with overall higher development activity. While this may not be necessarily practically significant for our use case (for example, if a cell had 130 actual permitted properties, even an estimate with a large error like 160 may still be helpful), the model could be further improved by including more relevant predictors (for example, recent zoning changes in nearby parcels), or by utilizing a modeling approach that is less sensitive to class imbalance than logistic regression.

Further improvements to the model could include indicators more specifically targeted to commercial activity, as well as delivering related information (like the number of new businesses opening) to our end user. Finally, a more sensitive model would allow us to provide predictions at finer time scales, such as yearly estimates, rather than pooling 5 years’ worth of data.